Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 2 de 2
Filter
Add filters

Database
Main subject
Language
Document Type
Year range
1.
biorxiv; 2022.
Preprint in English | bioRxiv | ID: ppzbmed-10.1101.2022.10.10.511571

ABSTRACT

Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLM represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate the scaling of GenSLMs on both GPU-based supercomputers and AI-hardware accelerators, achieving over 1.54 zettaflops in training runs. We present initial scientific insights gleaned from examining GenSLMs in tracking the evolutionary dynamics of SARS-CoV-2, noting that its full potential on large biological data is yet to be realized.

2.
biorxiv; 2020.
Preprint in English | bioRxiv | ID: ppzbmed-10.1101.2020.07.16.207308

ABSTRACT

In response to the COVID-19 pandemic caused by the SARS-CoV-2 virus, structural biologists are using experimental structural determination methods to better understand the viral proteome. Our goal in this work was to help researchers use these rapidly emerging structural data to gain detailed insights into the molecular mechanisms underlying COVID-19 infection. Our analysis was based on the protein sequences defined by UniProt as comprising the viral proteome. We systematically compared each SARS-CoV-2 protein sequence against all available protein 3D structures derived from any organism (164,250 PDB entries), using pairs of hidden Markov models built with the HHblits tool. We found 872 sequence-to-structure alignments assessed to have significant similarity (E < 10e-10) to infer structural similarity. The resulting 872 3D template models now provide a wealth of new detail, currently not available from related resources. To help make this large, complex dataset accessible and usable for other researchers, we also developed a tailored layout strategy to visually organise the 3D models by mapping them to the viral genome. The resulting graph provides an immediate and comprehensive visual overview of what is known - and not known - about the 3D structure of the viral proteome, thereby helping direct future research. The graph also clearly reveals all available structural evidence of viral mimicry or hijacking of human proteins, as well as all evidence of interactions between viral proteins. We have created PDF and online versions of the graph, in which users can click on any node in the graph to open the corresponding 3D model in the Aquaria molecular graphics system. In Aquaria, these models can then be colored to show sequence features, such as single nucleotide polymorphisms and posttranslational modifications. Previous versions of Aquaria showed only features from UniProt; however, as part of this study, we have now added features from PredictProtein and CATH, thus providing a total of 32,717 features for SARS-CoV-2 protein sequences. In this work, we present novel insights found, using the above approach, into how SARS-CoV-2 mimics and hijacks host proteins, and how viral proteins self-assemble during infection. The resulting Aquaria-COVID resource is freely available online at https://aquaria.ws/covid19, and an accompanying video (https://youtu.be/J2nWQTlJNaY) explains how researchers can use the resource.


Subject(s)
COVID-19
SELECTION OF CITATIONS
SEARCH DETAIL